About the Document

This document includes data analysis tasks related to the course project of Introduction to R Programming of 365 Data Science platform.

About the Project

According to the project overview, the goal is to get an understanding on how the characteristics of housing market influence other important characteristics such as the home value. In other words, using the available data, stakeholders want to infer and make predictions about the housing market in the observed environment.

To do so, a dataset with the following variables is provided:

  • Crime.Rate: Local crime rate per capita
  • Average.Rooms: Average number of rooms in homes
  • Public.Transport.Access: Proximity to public transport
  • Number.of.Schools: Number of local schools
  • Median.Home.Value: Median value of homes

Dataset from a Bird’s-eye View

As the above list of variables shows, there are statistical measures, such as rate, average, and median (namely Crime.Rate, Average.Rooms, and Median.Home.Value), associated with each observation in the dataset. This suggests that each observation is related to a set of elements, housing units in this case, from which these statistical measures are calculated.

However, the variables that appear not to be statistical measures, such as Public.Transport.Access and Number.of.Schools, are constant values that apply for all the elements in the set, that is for all the housing units, with which each observation is associated.

It appears that the observations represent some kind of areas, for example, neighborhoods, or blocks, which is not stated explicitly. For the rest of the analysis, keeping this aspect of the dataset abstract and not clearly specified, the term housing area will be used.

In that light, we have a pre-collected data on housing areas, where each housing area has characteristics of:

  • crime rate per capita of the area
  • average rooms of the housing units located in the area
  • median home value of the housing units located in the area
  • public transport access of the area
  • number of schools of the area

First Inspection and Uncertainties

Due to the lack of a detailed description of the provided dataset, there are a few uncertainties that one might be concerned about. These uncertainties might turn out to be unimportant regarding to the goal of the analysis yet it is better to make notes of them.

Let’s see if the first 10 rows of the dataset provide additional information or raise more questions…

##    Crime Rate Average Rooms Public Transport Access Number of Schools
## 1          NA      5.585324                      10                 3
## 2    2.654339      5.395206                       3                 6
## 3    4.619221      6.033965                       9                 4
## 4    6.807575      5.418335                      10                 5
## 5    2.414617      6.189320                       2                 4
## 6    2.414658      5.964833                       6                 4
## 7    6.948032      5.832736                       7                 4
## 8    4.918587      5.364705                       9                 4
## 9    1.826314      5.596260                       3                 6
## 10         NA      6.528774                      10                 4
##    Median Home Value
## 1           47.90077
## 2           41.53910
## 3           48.51757
## 4           42.50757
## 5           51.39125
## 6           49.64657
## 7           48.76959
## 8           38.21798
## 9           44.69063
## 10          52.59876

As already noted, the type of housing areas and housing units, with which the observations are associated is unclear. We also do not know the currency or the scale in which the home values (Median Home Value) are expressed. The unit, in which the Public Transport Access variable is expressed is also unknown. The Number of Schools seems to be an obvious measure, however, the type of schools these number refer to is not known. Crime Rate is also a vague indicator - even though the project describes it as a per capita measurement, one might be suspicious looking at those high rates in the above sample of the dataset.

As for the Public Transport Access variable, we could also think about it as a central measure of the housing units in the area. A constant value might indicate that the area encloses housing units that do not differ in physical location, while a central measure might indicate the aggregate of a housing area with related yet physically spread-out housing units.

Until a more detailed investigation suggests otherwise, these concerns are considered as resolved and handled as follows:

  • housing areas and enclosed housing units remain unspecified
  • home value is assumed to be expressed in USD yet:
    • the scale on which it is presented, without seeing the entire data available, remains unspecified
  • public transport access is probably expressed in minutes, however:
    • it is not guaranteed without seeing the entire data available, remains unspecified
    • it also does not indicate the type of transport, remains unspecified
    • nor does it indicate the routes of the transport, remains unspecified
  • number of schools is an obvious discrete measure yet:
    • the type of schools involved remains unspecified
    • the vacancy of schools involved remains unspecified
  • average rooms is an obvious aggregated continuous measure
  • crime rate is per capita, however, without seeing the entire data available, remains unspecified

Exploring and Analysing the Dataset

The project guide advises to explore and analyze the dataset to understand the overall structure and characteristics of the data by:

  • computing descriptive statistics
  • performing correlation analysis
  • identifying and handling missing values
# package imports related to all tasks
library(tidyverse)
library(magrittr)
library(ggcorrplot)
# seed
set.seed(1234)

The data is given as a CSV file called housing_data.csv. The following code loads the data into the object df.housing as a dataframe, keeping the original column names intact.

# load dataset
df.housing <- read.csv("../data/housing_data.csv", check.names = FALSE)

The imported dataset confirms the previously introduced variables.

# original colnames

val.df.housing.original_colnames <- df.housing |>
  colnames()

val.df.housing.original_colnames |>
  print()
## [1] "Crime Rate"              "Average Rooms"          
## [3] "Public Transport Access" "Number of Schools"      
## [5] "Median Home Value"

The dataset has the following number of rows, that is observations:

# number of rows
df.housing |>
  nrow()
## [1] 506

The dataset has the following structure:

# structure
df.housing |>
  str()
## 'data.frame':    506 obs. of  5 variables:
##  $ Crime Rate             : num  NA 2.65 4.62 6.81 2.41 ...
##  $ Average Rooms          : num  5.59 5.4 6.03 5.42 6.19 ...
##  $ Public Transport Access: int  10 3 9 10 2 6 7 9 3 10 ...
##  $ Number of Schools      : int  3 6 4 5 4 4 4 4 6 4 ...
##  $ Median Home Value      : num  47.9 41.5 48.5 42.5 51.4 ...

Before proceeding any further, the variable names get transformed into a form better suitable for the rest of the analysis.

# cleaning variable names

df.housing <- df.housing |>
  janitor::clean_names()

df.housing |>
  colnames()
## [1] "crime_rate"              "average_rooms"          
## [3] "public_transport_access" "number_of_schools"      
## [5] "median_home_value"

Also, since R does not have a built-in mode function, the project guide advises implementing one.

# mode function
## returns mode(s) if there are no more than 3 modes in the given variable,
## otherwise returns NA

fn.mode <- function(x) {
  t <- table(x)
  mode <- names(t)[which(t == max(t))]

  ## max 3 modes in a variable
  if (length(mode) > 3) {
    mode <- NA
  }

  return(as.numeric(mode))
}

Overview of Variables

There are missing values in the variables according to the following table:

# column-wise presence of NA

## function
fn.hasNA <- function(df) {
  (
    any(is.na(df))
  )
}

## table

out.presenceNA <- df.housing |>
  sapply(fn.hasNA)

names(out.presenceNA) <- paste(names(out.presenceNA), "has NA", sep = " ")

out.presenceNA |>
  print()
##              crime_rate has NA           average_rooms has NA 
##                           TRUE                           TRUE 
## public_transport_access has NA       number_of_schools has NA 
##                          FALSE                          FALSE 
##       median_home_value has NA 
##                          FALSE
# summary of variables
df.housing |>
  summary()
##    crime_rate        average_rooms   public_transport_access number_of_schools
##  Min.   : 0.005305   Min.   :4.112   Min.   : 1.000          Min.   : 0.000   
##  1st Qu.: 1.299938   1st Qu.:5.598   1st Qu.: 3.000          1st Qu.: 4.000   
##  Median : 3.031481   Median :6.033   Median : 5.000          Median : 5.000   
##  Mean   : 3.137415   Mean   :6.026   Mean   : 5.421          Mean   : 4.992   
##  3rd Qu.: 4.584798   3rd Qu.:6.460   3rd Qu.: 8.000          3rd Qu.: 6.000   
##  Max.   :12.631829   Max.   :7.801   Max.   :10.000          Max.   :10.000   
##  NA's   :25          NA's   :15                                               
##  median_home_value
##  Min.   :31.55    
##  1st Qu.:43.23    
##  Median :46.91    
##  Mean   :47.10    
##  3rd Qu.:50.85    
##  Max.   :62.56    
## 

The mode of the variables are, respectively:

# mode of variables
df.housing |>
  sapply(fn.mode)
##              crime_rate           average_rooms public_transport_access 
##                     0.1                      NA                     2.0 
##       number_of_schools       median_home_value 
##                     5.0                      NA

Crime Rate

Being a numerical type, this variable contains decimal values from approximately \(0\) to no more than \(13\). The mean is slightly larger than the median, that is, the frequency distribution of the variable is slightly stretched to the right, exhibiting right skewness. There are \(25\) observations with missing value for this variable. The mode is \(0.1\), however being a continuous variable, a binned mode shall be identified by the help of visualization.

The distribution of the variable, visualized in a histogram, using a binwidth of \(0.5\), is the following:

# histogram
df.housing |>
  ggplot() +
  geom_histogram(
    aes(crime_rate),
    binwidth = 0.5,
    boundary = 0.5,
    closed = "left",
    col = "purple",
    fill = "purple",
    alpha = 0.75
  ) +
  scale_x_continuous(breaks = seq(0, 13, by = 1)) +
  labs(
    title = "Frequency distribution of Crime Rate",
    subtitle = "bin intervals are left side closed",
    x = "Crime Rate",
    y = "Frequency"
  )

# boxplot
df.housing |>
  ggplot() +
  geom_boxplot(
    aes(crime_rate),
    fill = "cornflowerblue"
  ) +
  scale_x_continuous(
    breaks = seq(0, 13, by = 1)
  ) +
  labs(
    title = "Crime Rate intervals of equal parts of the dataset",
    x = "Crime Rate"
  ) +
  theme(
    axis.text.y = element_blank(),
    axis.ticks.y = element_blank()
  )

To sum up:

  • presence of missing values (NAs)
  • numerical type
  • distributed within [0.005305, 12.631829]
  • median (3.031481) < mean (3.137415)
  • the exact mode is 0.1, in terms of bins the mode is [0, 0.5)

Average Rooms

Being a numerical type, this variable contains decimal values from \(4.112\) to approximately no more than \(8\). The mean is almost exactly the same as the median, that is, no skewness is expected. There are \(15\) observations with missing value for this variable. There is no mode, however being a continuous variable, a binned mode shall be identified by the help of visualization.

The distribution of the variable, visualized in a histogram, using a binwidth of \(0.25\), is the following:

# histogram
df.housing |>
  ggplot() +
  geom_histogram(
    aes(average_rooms),
    binwidth = 0.25,
    boundary = 0.25,
    closed = "left",
    col = "purple",
    fill = "purple",
    alpha = 0.75
  ) +
  scale_x_continuous(breaks = seq(0, 8, by = 0.25)) +
  labs(
    title = "Frequency distribution of Average Rooms",
    subtitle = "bin intervals are left side closed",
    x = "Average Rooms",
    y = "Frequency"
  )

To sum up:

  • presence of missing values (NAs)
  • numerical type
  • distributed within [4.112, 7.801]
  • median (6.033) ≈ mean (6.026)
  • there is no exact mode, in terms of bins the mode is [6, 6.25)

Public Transport Access

Being an integer type, this variable contains whole numbers from \(1\) to \(10\). The mean is larger than the median, that is, the frequency distribution of the variable is stretched to the right, exhibiting right skewness. There are no missing values related to this variable, the mode is \(2\).

The distribution of the variable, visualized in a histogram, using a binwidth of \(1\), is the following:

# histogram
df.housing |>
  ggplot() +
  geom_histogram(
    aes(public_transport_access),
    binwidth = 1,
    boundary = 1,
    closed = "left",
    col = "purple",
    fill = "purple",
    alpha = 0.75
  ) +
  scale_x_continuous(
    limits = c(1, 11),
    breaks = seq(0, 11, by = 1)
  ) +
  labs(
    title = "Frequency distribution of Public Transport Access",
    subtitle = "bin intervals are left side closed",
    x = "Public Transport Access",
    y = "Frequency"
  )

To sum up:

  • no missing values (no NAs)
  • integer type
  • distributed within [1, 10]
  • median (5) < mean (5.421)
  • mode is 2

Number of Schools

Being an integer type, this variable contains whole numbers from \(0\) to \(10\). The mean is almost exactly the same as the median, that is, no skewness is expected. There are no missing values related to this variable. The mode is \(5\).

The distribution of the variable, visualized in a histogram, using a binwidth of \(1\), is the following:

# histogram
df.housing |>
  ggplot() +
  geom_histogram(
    aes(number_of_schools),
    binwidth = 1,
    boundary = 1,
    closed = "left",
    col = "purple",
    fill = "purple",
    alpha = 0.75
  ) +
  scale_x_continuous(
    limits = c(0, 11),
    breaks = seq(0, 11, by = 1)
  ) +
  labs(
    title = "Frequency distribution of Number of Schools",
    subtitle = "bin intervals are left side closed",
    x = "Number of Schools",
    y = "Frequency"
  )

To sum up:

  • no missing values (no NAs)
  • integer type
  • distributed within [0, 10]
  • median (5) ≈ mean (4.992)
  • mode is 5

Median Home Value

Being a numerical type, this variable contains decimal values from \(31.55\) to \(62.56\). The mean is slightly larger than the median, that is, the frequency distribution of the variable is slightly stretched to the right, exhibiting right skewness. There are no missing values related to this variable, nor does an exact mode exist, however being a continuous variable, a binned mode shall be identified by the help of visualization.

The distribution of the variable, visualized in a histogram, using a binwidth of \(2\), is the following:

# histogram
df.housing |>
  ggplot() +
  geom_histogram(
    aes(median_home_value),
    binwidth = 2,
    boundary = 2,
    closed = "left",
    col = "purple",
    fill = "purple",
    alpha = 0.75
  ) +
  scale_x_continuous(
    breaks = seq(0, 65, by = 2)
  ) +
  labs(
    title = "Frequency distribution of Median Home Value",
    subtitle = "bin intervals are left side closed",
    x = "Median Home Value",
    y = "Frequency"
  )

To sum up:

  • no missing values (no NAs)
  • numerical type
  • distributed within [31.55, 62.56]
  • median (46.91) < mean (47.10)
  • there is no exact mode, in terms of bins the mode is [46, 48)

Revisiting and Addressing Concerns

The following details, which, in fact, were not available at the time of the start of the project (the project hid this information until submission / it can be viewed as delayed information about the dataset) might shed light to some of the uncertainties and help resolving the concerns that had been stated previously.

A housing area represents a neighborhood and public transport access indicates the number of buses passing through the neighborhood within an hour.

Looking at the distribution of the variables of the dataset, we can settle on the scale, on which the median home value is measured, which is 1000 USD.

The crime rate scale seems to remain vaguely specified according to the following partially resolved concerns.

A crime rate is a standardized measure to compare the number of crimes in areas that differ in their population. It is usually expressed with the number of crimes projected to a population of 1.000 or 100.000. For example, two areas with crimes occurring 100, and 500 times a year in a population of 10.000, and 50.000, respectively, result in the same crime rate of 10 expressed for 1.000 people

The project does not specify the number of people the crime rate is expressed for, however, looking at the histogram and boxplot of crime rate might help. The boxplot indicates that more than 75% of the dataset is above 1. That is, in 75% of the observed housing areas exhibit a crime for each of the person living in the area. Unless the data aims to cover areas with high crime rate (which is not the case), the crime rate must indicate a value expressed for more than one person, such as 1.000 or 100.000.

The project also does not specify the type of crime measured by the crime rate variable. It could mean 0 to 13 crimes per a 1.000 or per a 100.000 depending on the seriousness of the crimes involved in this measurement.

Correlation Analysis

The project advises looking at whether there is a correlation between any of the variables.

A correlation matrix of the variables is computed and visualized as follows:

# correlation matrix

## matrix
cor(df.housing, use = "complete.obs")
##                         crime_rate average_rooms public_transport_access
## crime_rate              1.00000000   0.109411375             0.014246404
## average_rooms           0.10941138   1.000000000            -0.003768297
## public_transport_access 0.01424640  -0.003768297             1.000000000
## number_of_schools       0.02442190   0.005000546             0.035876982
## median_home_value       0.09161033   0.888351070             0.010709022
##                         number_of_schools median_home_value
## crime_rate                    0.024421905       0.091610332
## average_rooms                 0.005000546       0.888351070
## public_transport_access       0.035876982       0.010709022
## number_of_schools             1.000000000       0.004667281
## median_home_value             0.004667281       1.000000000
## visualization
df.housing |>
  ## use original column names
  rename_with(~val.df.housing.original_colnames) |>
  cor(use = "complete.obs") |>
  ggcorrplot(
    colors = c("black", "white", "purple"),
    method = "circle",
    type = "upper"
  ) +
  labs(
    title = "Correlation of Housing Market data variables"
  )

## coefficient of correlation
val.cor.avg_rooms_median_home_value <- cor(df.housing$average_rooms, df.housing$median_home_value, use = "complete.obs")

Other than a few very weak positive correlations between the variables, only one stands out significantly, which is the correlation between median home value and average rooms with a correlation coefficient of \(0.8884\).

The following scatterplot also displays how close the data points of these two variables are positively correlated.

# scatterplot of the correlated variable pair
df.housing |>
  ggplot() +
  geom_point(
    aes(average_rooms, median_home_value),
    shape = 21,
    size = 2.5,
    col = "cornflowerblue"
  ) +
  labs(
    title = "Scatterplot of Average Rooms and Median Home Value",
    x = "Average Rooms",
    y = "Median Home Value"
  )

Last but not least, the coefficient of determination is:

# coefficient of determination
val.cor.avg_rooms_median_home_value |>
  raise_to_power(2) |>
  print()
## [1] 0.7915119

That is, around \(79\%\) of the variability of median home value is accounted for the average rooms.

Handling Missing Values

Missing values introduce strong uncertainty into the picture. The way we handle missing values might depend on the number of such data points or the distribution of the affected variables.

One way to deal with them is to leave them out of subsequent analyses. In that case, however, one completely ignores these data points which might be useful otherwise. On the other hand, by replacing missing values, we need to specify the replacement values and be careful what fabricated values we introduce to the picture of our subsequent analyses.

As the previous sections indicate, there are two variables with missing values: crime rate and average rooms. The affected variables have missing values according to the following scatterplots and proportion visualizations.

# displaying median home values for which crime rate is missing
df.housing |>
  mutate(
    crime_rate = ifelse(is.na(crime_rate), -1, crime_rate)
  ) |>
  ggplot() +
  geom_point(
    aes(crime_rate, median_home_value, col = crime_rate == -1),
    shape = 21,
    size = 2.5
  ) +
  scale_color_manual(values = c("cornflowerblue", "red")) +
  labs(
    x = "Crime Rate",
    y = "Median Home Value",
    col = "X is missing"
  )

# displaying median home values for which average rooms is missing
df.housing |>
  mutate(
    average_rooms = ifelse(is.na(average_rooms), -1, average_rooms)
  ) |>
  ggplot() +
  geom_point(
    aes(average_rooms, median_home_value, col = average_rooms == -1),
    shape = 21,
    size = 2.5
  ) +
  scale_color_manual(values = c("cornflowerblue", "red")) +
  labs(
    x = "Average Rooms",
    y = "Median Home Value",
    col = "X is missing"
  )

# missing value proportions

## crime rate
df.housing |>
  group_by(missing = is.na(crime_rate)) |>
  count() |>
  ungroup() |>
  ggplot() +
  geom_bar(
    aes("x", n, fill = factor(missing)),
    stat = "identity",
    position = "fill"
  ) +
  scale_fill_manual(values = c("cornflowerblue", "coral")) +
  labs(
    x = "Crime Rate",
    y = "Fraction",
    fill = "Missing data"
  ) +
  theme(
    axis.ticks.x = element_blank(),
    axis.text.x = element_blank()
  )

## average rooms
df.housing |>
  group_by(missing = is.na(average_rooms)) |>
  count() |>
  ungroup() |>
  ggplot() +
  geom_bar(
    aes("x", n, fill = factor(missing)),
    stat = "identity",
    position = "fill"
  ) +
  scale_fill_manual(values = c("cornflowerblue", "coral")) +
  labs(
    x = "Average Rooms",
    y = "Fraction",
    fill = "Missing data"
  ) +
  theme(
    axis.ticks.x = element_blank(),
    axis.text.x = element_blank()
  )

The number of missing data in each of the two variables appears to be very small. Without delving into the numerous ways missing data could be replaced and its details, the median imputation will be used to do replacement in the above cases.

# missing crime rate replacement with median
df.housing |>
  mutate(
    crime_rate_replaced = is.na(crime_rate)
  ) |>
  mutate(
    crime_rate = ifelse(crime_rate_replaced,
      median(crime_rate, na.rm = TRUE),
      crime_rate
    )
  ) |>
  ggplot() +
  geom_point(
    aes(crime_rate, median_home_value, col = crime_rate_replaced),
    shape = 21,
    size = 2.5
  ) +
  scale_color_manual(values = c("cornflowerblue", "red")) +
  labs(
    x = "Crime Rate",
    y = "Median Home Value",
    col = "Imputation"
  )

# missing average rooms replacement with median
df.housing |>
  mutate(
    average_rooms_replaced = is.na(average_rooms)
  ) |>
  mutate(
    average_rooms = ifelse(average_rooms_replaced,
      median(average_rooms, na.rm = TRUE),
      average_rooms
    )
  ) |>
  ggplot() +
  geom_point(
    aes(average_rooms, median_home_value, col = average_rooms_replaced),
    shape = 21,
    size = 2.5
  ) +
  scale_color_manual(values = c("cornflowerblue", "red")) +
  labs(
    x = "Average Rooms",
    y = "Median Home Value",
    col = "Imputation"
  )

It appears, that the imputation does not visibly affect the crime rate variable, and as for the average rooms, only one or two strange points are introduced.

# summary before replacement
df.housing |>
  select(crime_rate, average_rooms) |>
  summary()
##    crime_rate        average_rooms  
##  Min.   : 0.005305   Min.   :4.112  
##  1st Qu.: 1.299938   1st Qu.:5.598  
##  Median : 3.031481   Median :6.033  
##  Mean   : 3.137415   Mean   :6.026  
##  3rd Qu.: 4.584798   3rd Qu.:6.460  
##  Max.   :12.631829   Max.   :7.801  
##  NA's   :25          NA's   :15
# replacement

val.df.housing.crime_rate.median <- median(df.housing$crime_rate, na.rm = TRUE)
val.df.housing.average_rooms.median <- median(df.housing$average_rooms, na.rm = TRUE)

df.housing.imputed <- df.housing |>
  mutate(
    crime_rate_replaced = is.na(crime_rate),
    average_rooms_replaced = is.na(average_rooms)
  ) |>
  mutate(
    crime_rate = ifelse(crime_rate_replaced,
      val.df.housing.crime_rate.median,
      crime_rate
    ),
    average_rooms = ifelse(average_rooms_replaced,
      val.df.housing.average_rooms.median,
      average_rooms
    )
  )

# summary after replacement
df.housing.imputed |>
  select(crime_rate, average_rooms) |>
  summary()
##    crime_rate        average_rooms  
##  Min.   : 0.005305   Min.   :4.112  
##  1st Qu.: 1.375937   1st Qu.:5.605  
##  Median : 3.031481   Median :6.033  
##  Mean   : 3.132181   Mean   :6.026  
##  3rd Qu.: 4.533860   3rd Qu.:6.451  
##  Max.   :12.631829   Max.   :7.801

Also, the correlation of average rooms and median home value remains almost the same.

# correlation of average rooms and median home value

val.cor.avg_rooms_median_home_value |>
  print()
## [1] 0.8896695
cor(df.housing.imputed$average_rooms, df.housing.imputed$median_home_value, use = "complete.obs")
## [1] 0.8711099

Hypothesis Test

One of the tasks of the project is to perform a hypothesis test regarding to the impact of crime rates on the median home values.

The objective is to determine if there’s a statistically significant difference in the median home values of neighborhoods with high crime rates compared to areas with low crime rates.

The null and alternative hypotheses are stated as follows:

\[ \begin{split} H_0:\text{There is no difference between the median home values of high and low crime-rate neighborhoods.} \\ H_A:\text{There is a difference between the median home values of high and low crime-rate neighborhoods.} \end{split} \]

The hypotheses can be expressed using the population means (\(\mu_L\) for low crime-rate areas, \(\mu_H\) for high crime-rate areas) as follows:

\[ \begin{split} H_0: \mu_L - \mu_H = 0 \\ H_A: \mu_L - \mu_H \neq 0 \end{split} \]

According to the above hypotheses, the test is a two-sided test, where the significance level and confidence interval are respectively the following:

\[ \begin{split} \alpha = 5\% = 0.05 \\ CI = (1-\alpha) = 95\% = 0.95 \end{split} \]

Since we need to use two samples to estimate the appropriate populations, the given dataset shall be split into two, based on the crime rate. First, the median value of the crime rate is designated as a threshold (\(T\)), at which the two crime rate categories will be separated from each other.

\[ \begin{split} T = median(\text{Crime Rate}) \\ Crime \ Category(x) = \begin{cases} High, & \text{if } x \ge T, \\ Low, & \text{otherwise}. \end{cases} \end{split} \]

Refine Imputation

Before further proceeding in the hypothesis test, we must address the fact that missing data of crime rate have been replaced by the median value. That is, if we designate the median as a threshold, we end up placing all the imputed (previously missing) values into the high crime category.

In concern with the uncertainty of the missing values, we should refine the way we impute these crime rate values. It would be better to still use the median for the replacement value as a base, yet increase, and decrease the two halves of the data points in question by an insignificant amount. So instead of just putting them into the dataset with the same value, which would result all of them to fall into one crime category, the imputation is made in a way that the values get scattered around the median.

# revisiting summaries of crime rate

## original
df.housing |>
  select(crime_rate) |>
  summary()
##    crime_rate       
##  Min.   : 0.005305  
##  1st Qu.: 1.299938  
##  Median : 3.031481  
##  Mean   : 3.137415  
##  3rd Qu.: 4.584798  
##  Max.   :12.631829  
##  NA's   :25
## after median imputation
df.housing.imputed |>
  select(crime_rate) |>
  summary()
##    crime_rate       
##  Min.   : 0.005305  
##  1st Qu.: 1.375937  
##  Median : 3.031481  
##  Mean   : 3.132181  
##  3rd Qu.: 4.533860  
##  Max.   :12.631829
# refine imputation

## previous values of imputation (list of median)
val.imputation.values <- df.housing.imputed$crime_rate[df.housing.imputed$crime_rate_replaced == TRUE]

## generate deviates
val.imputation.deviates <- c(
  runif(
    val.imputation.values |> length() |> divide_by(2) |> add(1),
    -5e-3,
    0
  ),
  runif(
    val.imputation.values |> length() |> divide_by(2) |> add(1),
    0,
    5e-3
  )
) |>
  sample(val.imputation.values |> length())

## new version of the dataset
df.housing.imputed.refined <- df.housing.imputed

## update affected data points by deviates
df.housing.imputed.refined$crime_rate[df.housing.imputed.refined$crime_rate_replaced == TRUE] <- val.imputation.values |>
  add(val.imputation.deviates)

## visualization of imputations and the original median
df.housing.imputed.refined |>
  filter(crime_rate_replaced == TRUE) |>
  ggplot() +
  geom_point(
    aes(crime_rate, median_home_value, col = (crime_rate < val.df.housing.crime_rate.median)),
    size = 4,
    alpha = 0.75
  ) +
  geom_vline(xintercept = val.df.housing.crime_rate.median) +
  scale_color_manual(values = c("purple", "coral")) +
  labs(
    x = "Crime Rate",
    y = "Median Home Value",
    col = "CR < median"
  )

## proportion of imputations on each side of the original median
df.housing.imputed.refined |>
  filter(crime_rate_replaced == TRUE) |>
  mutate(
    imputation_side = crime_rate >= val.df.housing.crime_rate.median
  ) |>
  count(imputation_side) |>
  pull(n) |>
  prop.table()
## [1] 0.48 0.52
# after refined imputation

## summary
df.housing.imputed.refined |>
  select(crime_rate) |>
  summary()
##    crime_rate       
##  Min.   : 0.005305  
##  1st Qu.: 1.375937  
##  Median : 3.031581  
##  Mean   : 3.132169  
##  3rd Qu.: 4.533860  
##  Max.   :12.631829
## visualization
df.housing.imputed.refined |>
  ggplot() +
  geom_point(
    aes(crime_rate, median_home_value, col = crime_rate_replaced),
    shape = 21,
    size = 2.5
  ) +
  scale_color_manual(values = c("cornflowerblue", "red")) +
  labs(
    x = "Crime Rate",
    y = "Median Home Value",
    col = "Refined Imputation"
  )

Continue the Test

# separate sample into two, based on two crime categories

## threshold
val.crime_category_threshold <- val.df.housing.crime_rate.median

## add new variable
df.housing.imputed.refined <- df.housing.imputed.refined |>
  mutate(
    crime_category = ifelse(crime_rate >= val.crime_category_threshold,
      "High",
      "Low"
    )
  ) |>
  mutate(
    crime_category = fct(crime_category)
  ) |>
  relocate(
    crime_category,
    .after = crime_rate
  )

Since the observations had been labeled using the median crime rate value as a threshold, the two categories are equally proportioned.

# proportions of the two categories
df.housing.imputed.refined |>
  select(crime_category) |>
  table() |>
  prop.table() |>
  round(3)
## crime_category
##  High   Low 
## 0.502 0.498

As for a visual, the observations and the variables the hypothesis test is related to can be plotted as follows:

# hypothesis related scatter plot
df.housing.imputed.refined |>
  ggplot() +
  geom_point(
    aes(crime_rate, median_home_value, col = crime_category),
    shape = 21,
    size = 2.5
  ) +
  scale_color_manual(values = c("cornflowerblue", "coral")) +
  labs(
    title = "Median Home Value over Crime Rate",
    subtitle = "points are distinguished based on their Crime Category",
    x = "Crime Cate",
    y = "Median Home Value",
    col = "Crime Category"
  )

To decide on what type of test to carry out, the following shall be considered:

  • the available dataset is only a sample which had been labeled by crime categories based on low / high crime rates
  • the parameters of each population (\(\mu\), \(\sigma^2\)) are unknown
    • however, normal distribution and equal variance is assumed
    • the hypotheses refer to the difference of the population parameters
    • however, each sub-sample has enough elements (\(n>30\)) to estimate the population parameters

The project guide advises to use t-test, which would use the sample statistics to estimate population parameters and would utilize a t-distribution with 504 degrees of freedom. However, a t-distribution with such a high degrees of freedom is very close to a normal distribution as depicted in the following diagram.

# degrees of freedom

val.degrees_of_freedom <- df.housing.imputed.refined |>
  nrow() |>
  subtract(2)

val.degrees_of_freedom |>
  print()
## [1] 504
# n vs. t distribution
data.frame(x = c(-3, 3)) |>
  ggplot(aes(x)) +
  stat_function(fun = dnorm, lwd = 3, col = "coral") +
  stat_function(fun = dt, args = list(df = val.degrees_of_freedom), lwd = .75) +
  geom_label(label = "~N(0, 1)", x = -1, y = .25, fill = "white", col = "coral", size = 5) +
  geom_label(label = "df = 504", x = 1, y = .25, size = 5)

In that light, I decided to proceed as follows:

  1. calculate and evaluate p-value in a z-test, using sample statistics as estimators and normal distribution
  2. calculate and evaluate p-value in a t-test, using sample statistics as estimators and t-distribution (df = 504)
  3. compare the results of the two approaches

Z-test

The z-test formula for a one-sample and a two-sample test, respectively:

\[ \begin{split} Z = \frac{val - EV}{SE} \\ \\ Z = \frac{diff - EV}{SE_{diff}} = \frac{diff - EV}{\sqrt(SE_1^2 + SE_2^2)} \end{split} \]

The standard error, using the sample standard deviation:

\[ SE = \frac{s}{\sqrt{n}} \]

Using the notations of the sample means and standard deviations, the following formula is established:

\[ Z = \frac{|\overline{x}_L - \overline{x}_H| - 0}{\sqrt{\frac{s_L^2}{n_L} + \frac{s_H^2}{n_H}}} \] Using R to calculate the z-score:

# rownum of crime categories, mean and standard deviation of their target variable

mtx.samples_n_mean_sd <- c(
  df.housing.imputed.refined |>
    filter(crime_category == "Low") |>
    nrow(),
  df.housing.imputed.refined |>
    filter(crime_category == "Low") |>
    pull(median_home_value) |>
    mean(),
  df.housing.imputed.refined |>
    filter(crime_category == "Low") |>
    pull(median_home_value) |>
    sd(),
  df.housing.imputed.refined |>
    filter(crime_category == "High") |>
    nrow(),
  df.housing.imputed.refined |>
    filter(crime_category == "High") |>
    pull(median_home_value) |>
    mean(),
  df.housing.imputed.refined |>
    filter(crime_category == "High") |>
    pull(median_home_value) |>
    sd()
) |>
  matrix(nrow = 2, ncol = 3, byrow = TRUE)

colnames(mtx.samples_n_mean_sd) <- c("n", "Mean", "Standard deviation")
rownames(mtx.samples_n_mean_sd) <- c("Low crime category", "High crime category")

mtx.samples_n_mean_sd |>
  print()
##                       n     Mean Standard deviation
## Low crime category  252 46.74135           5.230683
## High crime category 254 47.46291           5.722151
# z-score calculation
z.score <- (
  (mtx.samples_n_mean_sd[1, "Standard deviation"]^2 / mtx.samples_n_mean_sd[1, "n"]) |>
    add((mtx.samples_n_mean_sd[2, "Standard deviation"]^2 / mtx.samples_n_mean_sd[2, "n"]))
) |>
  sqrt() |>
  raise_to_power(-1) |>
  multiply_by(
    abs(mtx.samples_n_mean_sd[1, "Mean"] |>
      subtract(mtx.samples_n_mean_sd[2, "Mean"])) |>
      subtract(0)
  )

The z-score is the following:

# z-score
z.score |>
  print()
## [1] 1.480672

There are two ways to proceed, in order to evaluate how this score relates to our hypotheses.

  • calculate the p-value associated with the z-score and compare it with the significance level (\(\alpha\))
  • calculate the z-score associated with the significance level (\(\alpha\)) and compare it with the previously acquired z-score

The p-value is calculated by summarizing the area of the normal distribution curve under the following intervals:

  • \(Z \le -1.480672\)
  • \(Z \ge 1.480672\)
data.frame(z = c(-4, 4)) |>
  ggplot(aes(z)) +
  # normal curve
  stat_function(fun = dnorm, lwd = 1) +
  stat_function(fun = dnorm, geom = "area", fill = "lightgrey", lwd = 1) +
  # rejection area fractions
  stat_function(
    fun = dnorm,
    xlim = c(-4, qnorm(.025)),
    geom = "area",
    fill = "red2"
  ) +
  stat_function(
    fun = dnorm,
    xlim = c(qnorm(.975), 4),
    geom = "area",
    fill = "red2"
  ) +
  ## corresponding z-scores
  geom_vline(xintercept = qnorm(.025), col = "red2") +
  geom_label(x = qnorm(.025), y = .15, label = qnorm(.025) |> round(4), col = "red2") +
  geom_vline(xintercept = qnorm(.975), col = "red2") +
  geom_label(x = qnorm(.975), y = .15, label = qnorm(.975) |> round(4), col = "red2") +
  # p-value fractions
  stat_function(
    fun = dnorm,
    xlim = c(-4, -z.score),
    geom = "area",
    fill = "purple2",
    alpha = 0.5
  ) +
  stat_function(
    fun = dnorm,
    xlim = c(z.score, 4),
    geom = "area",
    fill = "purple2",
    alpha = 0.5
  ) +
  ## corresponding z-scores
  geom_vline(xintercept = -z.score, col = "purple2") +
  geom_label(x = -z.score, y = .35, label = -z.score |> round(4), col = "purple2") +
  geom_vline(xintercept = z.score, col = "purple2") +
  geom_label(x = z.score, y = .35, label = z.score |> round(4), col = "purple2")

According to the calculations and the visual, the absolute z-score associated with the half of the p-value on both tails is less than the z-score associated with the significance level, that is stands outside the rejection area, that is the p-value exceeds the significance level.

  • \(|z_{p/2}| < |z_{\alpha/2}|\), that is \(1.481 < 1.96\)
  • \(p > \alpha\), that is \(0.139 > 0.05\)

In that light, we can conclude that there is no statistically significant evidence to reject the null hypothesis, that is, the observed difference between median home prices in the two groups based on crime rate is due to chance and not due to the statistical significance of the observation.

Simply put, we do not reject \(H_0\), the null hypothesis.

T-test

The t-score can be calculated the same way as in the previous z-score equations, however the p-value is obtained from a t-distribution with a previously determined degrees of freedom of \(n_L + n_H - 2\).

# t-score
t.score <- z.score
# degrees of freedom
val.degrees_of_freedom |>
  print()
## [1] 504

The p-value is calculated by summing the areas of the t-distribution curve under the following intervals:

  • \(T \le -1.480672\)
  • \(T \ge 1.480672\)

As mentioned before, a t-distribution with such a high degrees of freedom closely approximates a normal distribution.

data.frame(t = c(-4, 4)) |>
  ggplot(aes(t)) +
  # t curve
  stat_function(fun = dt, args = list(df = val.degrees_of_freedom), lwd = 1) +
  stat_function(fun = dt, args = list(df = val.degrees_of_freedom), geom = "area", fill = "lightgrey", lwd = 1) +
  # rejection area fractions
  stat_function(
    fun = dt,
    args = list(df = val.degrees_of_freedom),
    xlim = c(-4, qt(.025, df = val.degrees_of_freedom)),
    geom = "area",
    fill = "red2"
  ) +
  stat_function(
    fun = dt,
    args = list(df = val.degrees_of_freedom),
    xlim = c(qt(.975, df = val.degrees_of_freedom), 4),
    geom = "area",
    fill = "red2"
  ) +
  ## corresponding t-scores
  geom_vline(xintercept = qt(.025, df = val.degrees_of_freedom), col = "red2") +
  geom_label(x = qt(.025, df = val.degrees_of_freedom), y = .15, label = qt(.025, df = val.degrees_of_freedom) |> round(4), col = "red2") +
  geom_vline(xintercept = qt(.975, df = val.degrees_of_freedom), col = "red2") +
  geom_label(x = qt(.975, df = val.degrees_of_freedom), y = .15, label = qt(.975, df = val.degrees_of_freedom) |> round(4), col = "red2") +
  # p-value fractions
  stat_function(
    fun = dt,
    args = list(df = val.degrees_of_freedom),
    xlim = c(-4, -t.score),
    geom = "area",
    fill = "purple2",
    alpha = 0.5
  ) +
  stat_function(
    fun = dt,
    args = list(df = val.degrees_of_freedom),
    xlim = c(t.score, 4),
    geom = "area",
    fill = "purple2",
    alpha = 0.5
  ) +
  ## corresponding t-scores
  geom_vline(xintercept = -t.score, col = "purple2") +
  geom_label(x = -t.score, y = .35, label = -t.score |> round(4), col = "purple2") +
  geom_vline(xintercept = t.score, col = "purple2") +
  geom_label(x = t.score, y = .35, label = t.score |> round(4), col = "purple2")

The slight differences can be observed in the t-score associated with the rejection area. It is a bit more far away from zero than in the case of the z-score, because now the tails are slightly fatter.

  • \(|t_{p/2}| < |t_{\alpha/2}|\), that is \(1.4807 < 1.9647\)
  • \(p > \alpha\), that is \(0.1391 > 0.05\)

Verification

Verifying the above calculations by running stats::t.test.

# stats::t.test
t.test(
  formula = median_home_value ~ crime_category,
  data = df.housing.imputed.refined,
  alternative = "two.sided",
  mu = 0,
  conf.level = 0.95,
  var.equal = TRUE
)
## 
##  Two Sample t-test
## 
## data:  median_home_value by crime_category
## t = 1.4801, df = 504, p-value = 0.1395
## alternative hypothesis: true difference in means between group High and group Low is not equal to 0
## 95 percent confidence interval:
##  -0.2362075  1.6793308
## sample estimates:
## mean in group High  mean in group Low 
##           47.46291           46.74135

Linear Regression

Another task of the project is to perform a regression, to predict median home value based on a suitable predictor.

It has been shown that the target variable exhibits correlation with the average rooms one. Since they seem to align in a linear fashion, a regression of linear would suit the needs.

# linear regression
df.housing.imputed.refined |>
  ggplot(aes(average_rooms, median_home_value)) +
  geom_point(
    shape = 21,
    size = 2.5,
    color = "cornflowerblue"
  ) +
  geom_smooth(
    formula = y ~ x,
    method = "lm",
    se = FALSE,
    col = "coral",
    fullrange = TRUE
  ) +
  scale_x_continuous(
    limits = c(0, 12),
    breaks = 0:12
  ) +
  scale_y_continuous(
    breaks = seq(0, 100, by = 5)
  ) +
  labs(
    title = "Linear regression scatterplot of Average Rooms vs Median Home Value",
    x = "Average Rooms",
    y = "Median Home Value"
  )

# linear regression model

mdl.linear_regression <- lm(
  formula = median_home_value ~ average_rooms,
  data = df.housing.imputed.refined
)

mdl.linear_regression |>
  summary()
## 
## Call:
## lm(formula = median_home_value ~ average_rooms, data = df.housing.imputed.refined)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.6212 -1.6873  0.0096  1.7347 13.9495 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     3.7854     1.0944   3.459 0.000588 ***
## average_rooms   7.1886     0.1805  39.823  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.698 on 504 degrees of freedom
## Multiple R-squared:  0.7588, Adjusted R-squared:  0.7584 
## F-statistic:  1586 on 1 and 504 DF,  p-value: < 2.2e-16

According to the linear regression model, the regression line intercepts the ordinate at \(3.7854\) and has a slope of \(7.1886\), from which the following regression equation can be drawn:

\[ \begin{split} \hat{y}(x) = \beta_0 + \beta_1\times{x} \\ \hat{y}(x) = 3.7854 + 7.1886\times{x} \end{split} \]

With missing values being replaced, the coefficient of determination (\(R^2\)) indicates that around \(75\%\) of the variation of the dependent variable (median home value) can be accounted for the independent variable (average rooms).

This linear regression model indicates that, on average, each additional room results in a \(\$7188.6\) increase in the median home value.